Home Credit Default Risk

Team Members

image.png

1.0 FPGroupN 11 HCDR

1.1 Phase Leader Plan

image.png

1.2 Credit Assignment Plan

image.png image-2.png

image-3.png image-6.png

1.3 Abstract

The main objective of this project is to create the best machine learning model that can determine whether a loan application will be able to repay the loan. We examined the data in Phase 1 and completed our initial baseline models. We expanded our work from Phase 1 to Phase 2 and applied fundamental feature engineering, where we considered probable characteristics from other tables and constructed base models. In terms of performance on these data, logistic regression performed the best and decision tree and random forest came in second and third, respectively.

In Phase 3, we carefully chose the features from the generated features, analyzed the significance of the features, and applied hyper parameter tuning. We improved upon our baseline model using feature engineering, yielding 4 more characteristics which were Salary-to Credit Ratio, Total External Source AMT Credit to Annuity Ratio, Annuity to Salary Ratio. On baseline models, we performed hyper parameter adjustment, and Lasso and Ridge were also used, and we achieved the test accuracy of 92.13% for Decision Tree with AUC 71.8% for Lasso and Ridge. Our Kaggle Score were 0.71673(private) and 0.73086(public)

In Phase 4, we ran a deep learning algorithm to forecast the project's eventual aim. The Multi-Layer Perceptron was the Deep Learning algorithm that was employed. Implementing MLP and displaying the training model on TensorBoard were the primary objectives of this phase. Our accuracy increased to 92.4% along with Testing AUC of 60.13%. We found out that there is slight leakage in the pipeline and we think that is the reason our Kaggle Score was decreased to 0.504 even after receiving the testing AUC of 0.6013

1.4 Data and Task Description

image-7.png

1.5 Gantt Chart

image-4.png

1.6 Machine Learning Algorithms and Metrics

The outcome of this project is to predict, whether the customer will repay the loan or not. That’s why this is a classification task where the outcome is 0 or 1. To classify this problem we will be building the following machine-learning models:

  1. Logistics Regression:
    • In our case, the number of features is relatively small i.e. <1000, and no. of examples is large. Hence logistic regression can be a good fit here for the classification.

  2. Decision Tree:
    • Decision trees are better for categorical data and our target data is also categorical in nature that’s why decision trees are a good fit.

  3. Random Forest:
    • Random Forest works well with a mixture of numerical and categorical features. • As we have a good amount of mixture of both types of features random forest can be a good fit.

  4. Lasso Regression:
    • The bias-variance trade-off is the basis for Lasso's superiority over least squares. The lasso solution can result in a decrease in variance at the cost of a slight increase in bias when the variance of the least squares estimates is very large. Consequently, this can produce predictions that are more accurate.

  5. Ridge Regression:
    • Any data that exhibits multicollinearity can be analyzed using the model-tuning technique known as ridge regression. This technique carries out L2 regularization. Predicted values differ much from real values when the problem of multicollinearity arises, least-squares are unbiased, and variances are significant.

  6. Multi-Layer Perceptron:
    • The perceptron can be used to define a linear decision boundary in a binary classification model. The separating hyperplane that reduces the separation between incorrectly classified points and the decision boundary is found. Input and output layers, as well as one or more hidden layers with a densely packed number of neurons, make up a multilayer perceptron.

1.6.1 Loss Function

1.6.2 Metrics

  1. Confusion Metrics:
    • A confusion matrix, also called an error matrix, is used in the field of machine learning and more specifically in the challenge of classification. Confusion matrices show counts between expected and observed values. The result "TN" stands for True Negative and displays the number of negatively classed cases that were correctly identified. Similar to this, "TP" stands for True Positive and denotes the quantity of correctly identified positive cases. The term "FP" denotes the number of real negative cases that were mistakenly categorized as positive, while "FN" denotes the number of real positive examples that were mistakenly classed as negative. Accuracy is one of the most often used metrics in classification.

image.png

  1. AUC:
    • AUC stands for "Area under the ROC Curve." It measures the entire two-dimensional area underneath the entire ROC curve from (0,0) to (1,1). It is a widely used accuracy method for binary classification problems.
  2. Accuracy:
    • The accuracy score is used to gauge the model's effectiveness by calculating the ratio of total true positives to total true negatives across all made predictions. Accuracy is generally used to calculate binary classification models.

1.7 Machine Learning Pipeline Steps

image-2.png

1.8 Block Diagram

image.png

Previous Experiments

Phase 2: Baseline

We built the baseline models as part of phase 2

Below are the experiments we conducted in Phase 2 for baseline models.

Screen%20Shot%202022-12-13%20at%201.47.49%20AM.png

Phase 3: Hyper-Parameter Tuning

As part of Phase 3, we hypertuned the model used in phase 2.

We implemented different algorithms during this phase such as Logistic regression, Decison Tree, Random Forest, lasso Regression and Ridge Regression.

We conducted multiple experiments as part of phase 3, below are some of them.

image.png

image.png

image.png

Phase 4: MLP

FETCHING IMPORTANT RELAVENT FEATURES

Note: We receivd error as port already in use so we rann the algorithm on different machine to generate the tensor board report. Attaching the results below.

TensorBorad_Relu_Sigmoid.png

Note: We receivd error as port already in use so we rann the algorithm on different machine to generate the tensor board report. Attaching the results below.

TensorBoard_ReLU.png

Note: We receivd error as port already in use so we rann the algorithm on different machine to generate the tensor board report. Attaching the results below.

TensorBoard_Sigmoid.png

Result and Discussion

In Phase 1, we finalized the machine learning models and different project tracking tools allocated tasks to team members built the Phase leader plan the credit assignment plan and Gantt chart. Along with this we also performed some basic data exploration and explored the dataset.


In phase 2 we performed the baseline models of all the selected models and found out the training and testing accuracies.


In Phase 3 we performed feature engineering and hyper parameter tunning and reran the hyperparameter tuned models. We added 2 new models the Lasso Regression and Ridge regression which gave us the improved ROC AUC of 75.57 % than the baseline models.

  1. The hyper-tuned decision tree model's train (92.19) and test (92.13) accuracy have greatly risen in comparison to the model's baseline, indicating that it is functioning well on the input dataset.
  2. The overall accuracy of the decision tree grew significantly and reached 92%. Hyper tuned Decision Tree is the best-fit algorithm since it surpasses others models in Phase 3.
  3. We saw an improvement of 6% in test accuracy and 6% in total accuracy. Decision tree outperforms the other models since its log loss (0.24) is on the lower side and has greatly lowered when compared to its baseline model.

In this Phase, we implemented deep learning, and our model was a multi-layer perceptron. We then generated a visualization of the loss function and accuracy using Tensor Board to visualize our training model. We improved our Kaggle submission from 0.5 to 0.71 which is a 38% improvement. We received a score of 0.71 in our results. Hyperparameter tuning helped improve the Test Accuracy. Our accuracy for the multi-Layer perceptron was out to be 92.4% which is quite efficient and overall good for the given data.
We have performed various experiments and below is the brief description:
We can conclude that our best models are Decision Tree with accuracy on higher side (92.03), ROC score of 53.58% and Lasso Regression with accuracy on a lower side but ROC score on a higher side giving us ROC score of 75.57%. We also built a Deep learning model Multi-Layer Perceptron using PyTorch. The Test Accuracy was on a higher side but we were getting ROC score on the Lower side (60.13%) as compared to the baseline models.

  1. Logistic Regression: In Phase 2 we performed, baseline logistic regression with limited feature engineering and it gave us the Train Accuracy of 92 % and test accuracy of 91.9 %. In phase 3 we performed feature engineering and after running it on multiple experiments it gave us the train Accuracy of 91.91% and test accuracy of 91.98%.

  2. Random Forest: In Phase 2 we performed, baseline Random regression with limited feature engineering and it gave us the Train Accuracy of 92.2 % and test accuracy of 92.2%. In phase 3 we performed feature engineering and after running it on multiple experiments it gave us the train Accuracy of 91.91% and test accuracy of 91.98%.

  3. Decision Tree: In Phase 2 we performed, baseline Decision regression with limited feature engineering and it gave us the Train Accuracy of 86.4 % and test accuracy of 86.1 %. In phase 3 we performed feature engineering and after running it on multiple experiments it gave us the train Accuracy of 91.13% and test accuracy of 92.13%.

  4. Lasso Regression: In Phase 3 after hyper-parameter tuning and addition feature engineering it gave us training and testing accuracy on a lower side but ROC on a higher side of 75.57%.

  1. Ridge Regression: In Phase 3 after hyper-parameter tuning and addition feature engineering it gave us training and testing accuracy on a lower side but ROC on a higher side of 75.68%.

  2. Deep Learning Model: In Phase 4 we built a Multi-layer perceptron model and experimented with different types of activation function like RuLU, Sigmoid etc. along with different epochs. Our best testing accuracy was 93.65 % along with ROC of 60.13%

Conclusion

The major objective of this project is to develop a machine learning model that can forecast a loan applicant's ability to repay a loan. Without any statistical analysis, many deserving applicants with no credit history or default history are accepted. The HCDR dataset is used in our work to train the machine learning model. A user's average, minimum, and maximum balances as well as reported Bureau scores, salary, and other factors are used to generate a credit history, which serves as a gauge of their reliability.

We improved our classification model through feature engineering, feature selection, and hyperparameter tuning to more precisely forecast whether the loan applicant will be able to repay the loan or not. As part of this project, we developed machine learning pipelines, undertaking exploratory data analysis on the datasets provided by Kaggle, and evaluating the models using several evaluation measures before deploying one.

During Phase 2 we concluded that the best model for this dataset will be logistic regression having the highest accuracy of 91.9 %. In Phase 3, we expanded our feature engineering and added some more relevant features like AMT Credit to Annuity Ratio and Annuity to Salary Ratio. On top of that we discarded features with more than 30 % null values as compared to phase 2. Because of hyperparameter tuning our accuracy and AUC both increased significantly. Decision tree turned out the best model in phase 3 along with the best parameters as gini criterion, max_depth = 4 and min samples leaf = 4. We achieved accuracy of 92.13 % by decision tree. Along with that Lasso also turned out to be best model in terms of AUC 0.71.

As part of phase 4, we implemented Multi-Layer Perceptron with activation function ReLU and Sigmoid along with multiple hidden layer settings, we got the result that ReLU and Sigmoid together can give us the effective result. We have determined that the Multi-Layer Perceptron's accuracy for the provided dataset using PyTorch was 93.65%, which is extremely effective and generally good. The test AUC for MLP has been fluctuating and has come down to 0.6 after performing multiple experiments.

With all the experiments taken into consideration Decision Tree and Lasso Regression turned out to be the best performing models. Additional experimentation on MLP model could have yielded better results.

Bibliography